How to exploit data on Pangeo#

Pangeo Workshop - Snow and Cloud Cover#

converted from EO-College/cubes-and-clouds

original author: Michele Clous @clausmichele conversion by: Pangeo volunteers (Pier Lorenzo Marasco @pl-marasco, Alejandro Coca-Castro @acocac, Anne Fouilloux @annefou, Justus Magin @keewis, Tina Odaka @tinaodaka)

Introduction#

In this exercise, we will build a complete the same EO workflow as OpenEO using cloud provided data (STAC Catalogue), processing it locally; from data access to obtaining the result.

We are going to follow these steps in our analysis:

  • Load satellite collections

  • Specify the spatial, temporal extents and the features we are interested in

  • Process the satellite data to retrieve snow cover information

  • aggregate information in data cubes

  • Visualize and analyse the results

#

Important Infos

More information on Pangeo can be found here: https://pangeo.io/ More information on the STAC specification can be found here: https://stacspec.org/

Import libraries#

# Data Manipulation and Analysis Libraries
import pandas as pd  
import numpy as np 

# Geospatial Data Handling Libraries
import geopandas as gpd 
from shapely.geometry import mapping  
import pyproj

# Multidimensional and Satellite Data Libraries
import xarray as xr 
import rioxarray as rio
import stackstac

# Data Visualization Libraries
import holoviews as hv
import hvplot.xarray
import hvplot.pandas

# Data parallelization and distributed computing libraries
import dask
from dask.distributed import Client, progress, LocalCluster

# STAC Catalogue Libraries
import pystac_client

Here we creates a Dask client, which is essential for managing and executing parallel computations efficiently in the subsequent parts of the notebook. There are situation where you can connect to a Dask Gateway, but for this exercise we will use a local cluster.

cluster = LocalCluster()
client = Client(cluster)

Client address can be copy and pasted to the dashboard to monitor the progress of the computations.

client

Client

Client-5246705e-78a6-11ee-8f42-6045bdc71d66

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

We will use the catchment as our area of interest (AOI) for the analysis. The catchment is defined by a polygon, which we will load from a GeoJSON file. The GeoJSON file contains the geometry of the catchment in the WGS84 coordinate reference system (EPSG:4326) and that has to be defined.

aoi = gpd.read_file('../data/catchment_outline.geojson', crs="EPGS:4326")
aoi_geojson = mapping(aoi.iloc[0].geometry)

Load satellite collections#

We will utilize the STAC API to search for satellite data in this exercise, specifically leveraging the API provided by AWS/Element84. The STAC API operates as a RESTful service, enabling the querying of satellite data with various filters such as spatial range, time period, and other specific metadata. This API is constructed based on the STAC specification, a collaborative, community-driven standard aimed at enhancing the discoverability and usability of satellite data. Numerous data providers, including AWS, Google Earth Engine, and Planet (Copernicus Data Space Ecosystem (CDSE) is coming soon **), among others, have implemented the STAC API, exemplifying its widespread adoption and utility in accessing diverse satellite datasets. We will limit the serch to the Sentinel 2 L2A collection, which is a collection of Sentinel 2 data that has been processed to surface reflectance (Top Of Canopy). We will also limit the search to the time period between 1st February 2019 and 10th June 2019 and to the extent of the catchment. ** at the moment of writing the STAC catalog of the CDSE is not yet fully operational.

URL = "https://earth-search.aws.element84.com/v1"
catalog = pystac_client.Client.open(URL)
items = catalog.search(
    intersects=aoi_geojson,
    collections=["sentinel-2-l2a"],
    datetime="2019-02-01/2019-06-10"
).item_collection()
len(items)
101

Get bands information#

As the original data provides bands with different names than the original Sentinel 2 bands, we need to get the information about the bands.

# Get bands information
# selected_item = items[1]
# for key, asset in selected_item.assets.items():
#     print(f"{key}: {asset.title}")

Load data#

We will use the stackstac library to load the data. The stackstac library is a library that allows loading data from a STAC API into an xarray dataset. Here we will load the green and swir16 bands, which are the bands we will use to calculate the snow cover. We will also load the scl band, which is the scene classification layer, which we will use to mask out clouds. Spatial resolution of 20m is selected for the analysis. The data is loaded in chunks of 2048x2048 pixels, about this we will talk more later.

ds = stackstac.stack(items,
                    bounds_latlon=aoi.iloc[0].geometry.bounds,
                    resolution=20,
                    chunksize=2048,
                    assets=['green', 'swir16', 'scl'])
/home/runner/micromamba/envs/bids23/lib/python3.11/site-packages/stackstac/prepare.py:408: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  times = pd.to_datetime(

Calculate snow cover#

We will calculate the Normalized Difference Snow Index (NDSI) to calculate the snow cover. The NDSI is calculated as the difference between the green and the swir16 bands divided by the sum of the green and the swir16 bands. For a metter of clarity we will define the green and the swir16 bands as variables. Other approches can be used to manage the data, but this is the one we will use in this exercise.

green = ds.sel(band='green')
swir = ds.sel(band='swir16')
scl = ds.sel(band='scl')

We will calculate the NDSI and we will mask out the clouds. We will use the scene classification layer (scl) to mask out the clouds. The scl is a layer that contains information about the type of land cover. We will mask out the clouds, which are identified by the values 8 and 9 in the scl layer.

ndsi = (green - swir) / (green + swir)

Dask allow to persist the data in memory, which is useful to speed up the computation. The persist method will load the data in memory and will keep it there until the end of the analysis. More explanation about persist, load and other dask related topics will be given in the next part of the exercise.

ndsi = ndsi.persist()

We will mask out the clouds, which are identified by the values 8 and 9 in the scl layer. More dettailed info can be found here: https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm-overview

snow = xr.where((ndsi > 0.42) & ~np.isnan(ndsi), 1, ndsi)
snowmap = xr.where((snow <= 0.42) & ~np.isnan(snow), 0, snow)
# mask = (scl != 8) & (scl != 9) & (scl != 3) 
mask = np.logical_not(scl.isin([8, 9, 3]))  # more elegant but not sure about it from a teaching perspective
snow_cloud = xr.where(mask, snowmap, 2)

Mask data#

As we are only interestd to the snow cover in the catchment, we will mask out the data outside the catchment. To acheive it we need to convert the catchment geometry to the same coordinate reference system as the data. The data is in the UTM32N coordinate reference system (EPSG:32632).

aoi_utm32 = aoi.to_crs(epsg=32632)
geom_utm32 = aoi_utm32.iloc[0]['geometry']

As we are going to use the RioXarray library to mask out the data, we need to add some more information to the data. We need to specify the coordinate reference system and the nodata value. Both informations can be found in the metadata of the data but we need to reinforce it so that RioXarray can use it.

snow_cloud.rio.write_crs("EPSG:32632", inplace=True)
snow_cloud.rio.set_nodata(np.nan, inplace=True)
<xarray.DataArray 'stackstac-06051b542f3bcca5bf5dc598e529973f' (time: 101,
                                                                y: 1708, x: 1367)>
dask.array<where, shape=(101, 1708, 1367), dtype=float64, chunksize=(1, 1708, 1367), chunktype=numpy.ndarray>
Coordinates: (12/53)
  * time                                     (time) datetime64[ns] 2019-02-01...
    id                                       (time) <U24 'S2B_32TPT_20190201_...
    band                                     <U6 'scl'
  * x                                        (x) float64 6.538e+05 ... 6.811e+05
  * y                                        (y) float64 5.203e+06 ... 5.169e+06
    instruments                              <U3 'msi'
    ...                                       ...
    title                                    <U30 'Scene classification map (...
    common_name                              object None
    center_wavelength                        object None
    full_width_half_max                      object None
    epsg                                     int64 32632
    spatial_ref                              int64 0

The clipping is done by the RioXarray library. The RioXarray library is a library that allows to manipulate geospatial data in xarray datasets. Underneath it uses the rasterio library that is a library built on top of GDAL.

snowmap_clipped = snow_cloud.rio.clip([geom_utm32])

It’s time to persist the data in memory. We will use the persist method to load the data in memory and keep it there until the end of the analysis.

clipped_date = snowmap_clipped.persist()

Aggregate data#

Data aggregation is a very important step in the analysis. It allows to reduce the amount of data and to make the analysis more efficient. Moreover as in this case we are going to aggregate the date to daily values, this will allow use to compute statistic on the data at the basin scale later on. The groupby method allows to group the data by a specific dimension. In this case we will group the data by the time dimension, aggregating to the date and removing the time information, once the group is obtained we will aggregate the data by taking the maximum value.

clipped_date.time
<xarray.DataArray 'time' (time: 101)>
array(['2019-02-01T10:27:41.665000000', '2019-02-01T10:27:55.803000000',
       '2019-02-03T10:17:41.721000000', '2019-02-03T10:17:56.196000000',
       '2019-02-06T10:27:38.251000000', '2019-02-06T10:27:52.378000000',
       '2019-02-08T10:17:45.015000000', '2019-02-08T10:17:59.486000000',
       '2019-02-11T10:27:41.425000000', '2019-02-11T10:27:55.562000000',
       '2019-02-13T10:17:41.410000000', '2019-02-13T10:17:55.888000000',
       '2019-02-16T10:27:37.782000000', '2019-02-16T10:27:51.906000000',
       '2019-02-18T10:17:44.392000000', '2019-02-18T10:17:58.867000000',
       '2019-02-21T10:38:42.647000000', '2019-02-21T10:41:10.812000000',
       '2019-02-23T10:19:13.347000000', '2019-02-26T10:41:01.433000000',
       '2019-02-26T10:41:55.303000000', '2019-02-28T10:21:35.588000000',
       '2019-02-28T10:29:49.805000000', '2019-03-03T10:33:41.307000000',
       '2019-03-03T10:36:05.204000000', '2019-03-05T10:22:31.946000000',
       '2019-03-05T10:30:35.058000000', '2019-03-08T10:42:08.379000000',
       '2019-03-08T10:42:27.451000000', '2019-03-10T10:17:56.798000000',
       '2019-03-10T10:28:22.527000000', '2019-03-13T10:36:46.652000000',
       '2019-03-13T10:39:12.927000000', '2019-03-15T10:20:59.295000000',
       '2019-03-15T10:29:06.351000000', '2019-03-18T10:37:10.029000000',
       '2019-03-18T10:39:22.262000000', '2019-03-20T10:19:22.499000000',
       '2019-03-20T10:27:32.752000000', '2019-03-23T10:36:21.527000000',
       '2019-03-23T10:38:46.412000000', '2019-03-25T10:20:53.406000000',
       '2019-03-25T10:29:20.163000000', '2019-03-28T10:35:16.487000000',
       '2019-03-28T10:35:16.629000000', '2019-03-28T10:40:53.056000000',
       '2019-03-28T10:42:15.585000000', '2019-03-30T10:22:02.335000000',
       '2019-03-30T10:30:34.915000000', '2019-04-02T10:27:48.466000000',
       '2019-04-02T10:30:54.718000000', '2019-04-04T10:18:06.913000000',
       '2019-04-04T10:26:41.592000000', '2019-04-07T10:38:10.473000000',
       '2019-04-07T10:40:29.076000000', '2019-04-09T10:25:25.369000000',
       '2019-04-09T10:33:58.503000000', '2019-04-12T10:30:15.981000000',
       '2019-04-12T10:32:48.834000000', '2019-04-14T10:17:55.475000000',
       '2019-04-14T10:27:14.986000000', '2019-04-17T10:27:52.483000000',
       '2019-04-17T10:30:14.193000000', '2019-04-19T10:24:00.526000000',
       '2019-04-19T10:32:34.802000000', '2019-04-22T10:28:35.634000000',
       '2019-04-22T10:42:36.803000000', '2019-04-24T10:19:36.413000000',
       '2019-04-24T10:27:43.597000000', '2019-04-27T10:27:44.859000000',
       '2019-04-27T10:27:58.984000000', '2019-04-29T10:17:52.241000000',
       '2019-04-29T10:18:06.712000000', '2019-05-02T10:27:49.108000000',
       '2019-05-02T10:28:03.252000000', '2019-05-04T10:17:48.626000000',
       '2019-05-04T10:18:03.103000000', '2019-05-07T10:27:45.232000000',
       '2019-05-07T10:27:59.373000000', '2019-05-09T10:17:53.127000000',
       '2019-05-09T10:18:07.599000000', '2019-05-12T10:27:49.818000000',
       '2019-05-12T10:28:03.963000000', '2019-05-14T10:17:48.610000000',
       '2019-05-14T10:18:03.075000000', '2019-05-17T10:27:45.056000000',
       '2019-05-17T10:27:59.182000000', '2019-05-19T10:17:53.452000000',
       '2019-05-19T10:18:07.921000000', '2019-05-24T10:17:48.052000000',
       '2019-05-24T10:18:02.521000000', '2019-05-27T10:27:44.339000000',
       '2019-05-27T10:27:58.464000000', '2019-05-29T10:17:53.232000000',
       '2019-05-29T10:18:07.701000000', '2019-06-01T10:27:49.632000000',
       '2019-06-01T10:28:03.768000000', '2019-06-03T10:17:47.026000000',
       '2019-06-03T10:18:01.498000000', '2019-06-06T10:27:43.405000000',
       '2019-06-06T10:27:57.521000000'], dtype='datetime64[ns]')
Coordinates: (12/51)
  * time                                     (time) datetime64[ns] 2019-02-01...
    id                                       (time) <U24 'S2B_32TPT_20190201_...
    band                                     <U6 'scl'
    instruments                              <U3 'msi'
    proj:epsg                                int64 32632
    s2:nodata_pixel_percentage               (time) object 0.000103 ... 10.65...
    ...                                       ...
    title                                    <U30 'Scene classification map (...
    common_name                              object None
    center_wavelength                        object None
    full_width_half_max                      object None
    epsg                                     int64 32632
    spatial_ref                              int64 0
clipped_date = snowmap_clipped.groupby(snowmap_clipped.time.dt.floor('D')).max(skipna=True)

as the data has been aggregated to daily values, we need to rename the floor method to something more meaningfull as date.

clipped_date = clipped_date.rename({'floor': 'date'})
clipped_date = clipped_date.persist()

Visualize data#

We will use the hvplot library to visualize the data. The hvplot library is a library that allows to visualize data in xarray datasets. It is based on the holoviews library, which is a library that allows to visualize multidimensional data. As we are going to visualize the data on a map, we need to specify the coordinate reference system of the data. The data is in the UTM32N coordinate reference system (EPSG:32632). This will allow the library to project the data on a map. More info on the hvplot library can be found here: https://hvplot.holoviz.org/

clipped_date.hvplot.image(
    x='x',
    y='y',
    groupby='date',
    crs=pyproj.CRS.from_epsg(32632),
    cmap='Pastel2',
    clim=(-1, 2),
    frame_width=500,
    frame_height=500,
    title='Snowmap',
    geo=True, tiles='OSM')

Compute statistics#

from the orinal notebook: Calculate Catchment Statistics We are looking at a region over time. We need to make sure that the information content meets our expected quality. Therefore, we calculate the cloud percentage for the catchment for each timestep. We use this information to filter the timeseries. All timesteps that have a cloud coverage of over 25% will be discarded.

Ultimately we are interested in the snow covered area (SCA) within the catchment. We count all snow covered pixels within the catchment for each time step. Multiplied by the pixel size that would be the snow covered area. Divided the pixel count by the total number of pixels in the catchment is the percentage of pixels covered with snow. We will use this number.

Get number of pixels in catchment: total, clouds, snow.

# number of cloud pixels
cloud = xr.where(clipped_date == 2, 1, np.nan).count(dim=['x', 'y']).persist()
# number of all pixels per each single date
aot_total = clipped_date.count(dim=['x', 'y']).persist()
# Cloud fraction per each single date expressed in % 
cloud_fraction = (cloud / aot_total * 100).persist()
# Visualize cloud fraction
cloud_fraction.hvplot.line(title='Cloud cover %', ylabel="&") * hv.HLine(25).opts(
    color='red',
    line_dash='dashed',
    line_width=2.0,
)

We are going to get the same information for the snow cover.

snow = xr.where(clipped_date == 1, 1, np.nan).count(dim=['x', 'y']).persist()
snow_fraction = (snow / aot_total * 100).persist()
# viaualize snow fraction
snow_fraction.hvplot.line(title='Snow cover area (%)', ylabel="%")
# mask out cloud fraction > 30% 
masked_cloud_fraction = cloud_fraction < 30
snow_selected = snow_fraction.sel(date=masked_cloud_fraction)
snow_selected.name = 'SCA'
snow_selected.hvplot.line(title="Snow fraction")

Let’s compare the date with the discharge data.

discharge_ds = pd.read_csv('data/ADO_DSC_ITH1_0025.csv', sep=',', index_col='Time', parse_dates=True)
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[31], line 1
----> 1 discharge_ds = pd.read_csv('data/ADO_DSC_ITH1_0025.csv', sep=',', index_col='Time', parse_dates=True)

File ~/micromamba/envs/bids23/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/micromamba/envs/bids23/lib/python3.11/site-packages/pandas/io/parsers/readers.py:611, in _read(filepath_or_buffer, kwds)
    608 _validate_names(kwds.get("names", None))
    610 # Create the parser.
--> 611 parser = TextFileReader(filepath_or_buffer, **kwds)
    613 if chunksize or iterator:
    614     return parser

File ~/micromamba/envs/bids23/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1448, in TextFileReader.__init__(self, f, engine, **kwds)
   1445     self.options["has_index_names"] = kwds["has_index_names"]
   1447 self.handles: IOHandles | None = None
-> 1448 self._engine = self._make_engine(f, self.engine)

File ~/micromamba/envs/bids23/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1705, in TextFileReader._make_engine(self, f, engine)
   1703     if "b" not in mode:
   1704         mode += "b"
-> 1705 self.handles = get_handle(
   1706     f,
   1707     mode,
   1708     encoding=self.options.get("encoding", None),
   1709     compression=self.options.get("compression", None),
   1710     memory_map=self.options.get("memory_map", False),
   1711     is_text=is_text,
   1712     errors=self.options.get("encoding_errors", "strict"),
   1713     storage_options=self.options.get("storage_options", None),
   1714 )
   1715 assert self.handles is not None
   1716 f = self.handles.handle

File ~/micromamba/envs/bids23/lib/python3.11/site-packages/pandas/io/common.py:863, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    858 elif isinstance(handle, str):
    859     # Check whether the filename is to be opened in binary mode.
    860     # Binary mode does not support 'encoding' and 'newline'.
    861     if ioargs.encoding and "b" not in ioargs.mode:
    862         # Encoding
--> 863         handle = open(
    864             handle,
    865             ioargs.mode,
    866             encoding=ioargs.encoding,
    867             errors=errors,
    868             newline="",
    869         )
    870     else:
    871         # Binary mode
    872         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/ADO_DSC_ITH1_0025.csv'

Let’s refine a little bit the data so that we can compare it with the snow cover data

start_date = pd.to_datetime("2019/02/01")
end_date = pd.to_datetime("2019/06/30")
# filter discharge data to start and end dates
discharge_ds = discharge_ds.loc[start_date:end_date]
discharge_ds.discharge_m3_s.hvplot(title='Discharge volume', ylabel='Discharge (m$^3$/s)') * snow_selected.hvplot(ylabel='Snow cover area (%)')  

Conclusion#

In this analysis, we have comprehensively examined the features, capabilities, and limitations of two prominent geospatial data processing frameworks: OpenEO and Pangeo. OpenEO offers a unified API that simplifies the process of accessing and processing earth observation data across various backends, allowing users to interact with different data sources seamlessly. Its standardized interface is a strong asset, making it accessible to a wide range of users, from researchers to application developers.

On the other hand, Pangeo excels in facilitating big data geoscience. Its robust ecosystem, built around existing Python libraries like Dask and Xarray, makes it a powerful tool for large-scale data analysis and visualization. Pangeo’s community-driven approach and open-source nature foster collaboration and innovation, promoting a dynamic and adaptable framework.

Each platform has its own set of advantages and constraints. OpenEO simplifies interoperability and enhances accessibility, making it particularly beneficial for users who wish to work with diverse data sources without delving deeply into the complexities of each backend. Pangeo, with its emphasis on leveraging existing Python tools and libraries, is particularly potent for those comfortable with Python and seeking to perform extensive, scalable analyses.

Choosing between OpenEO and Pangeo ultimately depends on the specific requirements and constraints of a project. Considerations such as the user’s familiarity with Python, the necessity for interoperability across various data backends, and the scale of data processing required should guide the decision-making process.